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Standardized test scores (STSt should be used on a 
local level: (11 a^ one component of evaluation of a student, school,, 
or district: (2) to draw as much interpretive meaning froia a ' . ■ 
norm-referenced test rtiRTl as their structure will support: (3) as a 
communication device with students, parents, €he p^ublic, and 
professional staff: (ttl to check status across grade levels in key 
subiect matter areas: (5) to ^om pa re ipter- and intra-grade scores ^ 
across years: (61 to discover individu^^ls whose measured achievement 
deviates , -significantly from their school ability 'level; and (7) to 
compare achievement scores with national^ local, and other sub^rojap^ 
n<>rmative data sets. The misuse of^TS is outlined,- as are the | 
sfiiveral weaknesses or limitations, of using "local norms** based only 
on the district ^population. Local norms often depend on ungualifiBd 
local person!iel for their interpretation, should be developed once, 
and then", re-used over the next few years in order to track year to/ 
year charioesr and may lead to educationally damaging .interpretations 
in low'^erfofifling districts. (RLl ^ ■ / , 
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Standairdized tests, have a large numbed .of uses and an even larger ^ 



number of msnses on a local leveU The following uses and misuses strike 

me as being e^eciallx important - or^ at leasts interesting to discuss* 
I 

It is important to stress at. the start that I do not pretend to speak for , 
my colleagues, either within The Psychological Corporation or in the **in- 
dustry" in general* TTiese are merely ^ biases; others should feel free to 
disagree, if they choose to be wrong. 

I 

The uses and misuses I ^sh to address can be conveniently grouped into 
threQ major groups - testingJs proper role in decision making^ using tests 
to assess status and change, and interpreting results in terms of various 



frames of reference. 
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I.. ^ TEST'S PROPER ROLE 

* ■ ■ '. ■ ' ■ ■ • 

Use;' As one component of evaliiation of a stu^e^it,- school, or dis-trict. 

Misuse; As the sole crdCerion for decision-making: e*g*, minipnim com- 
petency tests, promotion decisions. Title I evalxiations> teacher 
evaluation* . - 

No iB^ortant decision should ever be made based on a single piece of 
infolmtion, no matter how "reliable and valid" infonaation is* Yet> 
critical educational decisions are ma|e daily about children^ teachers, pro- 
grams, and schools based splely on single isolated sets ^of test.dcotes^ No 
doubt, part of the blame for this .rests with our enamoi' for numbers . Yet, 
that is a simplistic and incomplete explanation. I believe a larger part of 

1 : • . . • \ , 



Paper presented at the meeting of 'the American Educational Research Association, 
Los Angeles^ April 1981, ^ - , ■ 

'2 : --^ ' ■ 



the problem heife is. a sad reality — we (coHectivelyJ have simply failed to 
use anything othej than ndxm-referenced^t^ts to evaluate schools. or kids^ 
How can we assail newspaper reporters, boards of Education, Jl^islato;^s|||^. 
''the piiblic" for j^verinterpreting test scores when we provide no other 
evidence? Isn't it^^time we all got serious about seeking, out sound ^Iter- 
native assessment devices and using them in adaition|t'o, if not instead of, 
standardized tests? If we all honestly believe-that a 40-mnute reading test \ 
provides a less con^lete answer to our questions **Can. Jill read?'* or 
"Did the program work?" than does a trained professional, let's all say so an d 
provide the other evidence* .When we confront untrained people with a single 
piece of data, e:(pecia21y one with- decimalized numbers, is it really sur* ^ 
prisin^^that judgments are baSed solely on that, one piepe of -data? " 

All of th?t said, I'd like also to mention that one piece o^ aata is at . 
least a step in the right direction. The NEA o&trich position provides 
little soldjCe to a disgruntled, skeptical public* tfEA's argument is per* 
ceived — ^orrectly in my unbiased eyes — as being e^sehtially: "One 



piec^of ^Information isn't sxifficient for decision-making, especially when 
the information is far from perfect. Therefore, let's not provide any 



information and, there won't be any probleins." \Fortunately, few people — ' 
even among NEA's own membership — have been able to follow the logic any 
tetter/ than I. However, anyone who believes educational actountability is^ 
pass4 just because it/s the Atopic of only- one A^RA session this year has 
simply not b^en listening to people in the rea> world. 
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Use: ^ Drawing as much ityteipretive meaning from a NRT as their structure 
will support. ' ^ 

* 

Misuse: Pretending NRTs are "diagnostic** and criterion-referencec^. 

Regardless of what publishers* advertisements and promotion brochures - 
claim, or what all of us would like/ one test cannot be all things* to all 
people. The test" with greatest mar^ket appeal today would be a twenty-minute, 
diagnostic/prescriptive basic skills achievement battery — with -objective ^ 
mastery cutoffs and, of course, a full complement 'of ^normative data. Sooner 
or later, we — Sll of us in this . game are going, to have to honest and tell 
some people we don^t have isuch an instrument and we never will. . * 

There is a fine, an4 often indistinct, * line between milking a test for 
as much information it can reasonably field and overrinterpretation. My 
personal bias is that all of us have,/iaoi:$ often than we like to ^i^t, crossed 
the line. V 

Use: As a communication deviJe/— with students^ parents", "the piiblic" 

and professional staff* / ' ^ ^ . 

Misuse; Failure to report resiilti to all concerned groins. . 

A recent national survey (Sepk S Stetz, 19793. indicated that while * 

a almost 90% of the stu<lents in Gradeis 5-12 woul^d.like to find out their scores 

on Standardized tests, fewer th^ 1/3 of their teachers repoait the scores to 

students. Is it ^surprisitig wh<fn we hear of students whose attitudes toward 

taking such tests are less-thpi-i^e^l or who' give less than 100% .to com- 

^pletioh of the task? If "you/were .told .yearly that the "test you were taking 

was iiftportant^ Would be^^ea to help you, and ^hat you should db your best, 

hpw seriously would you take this information after being lied to five or 

seven ftimes? On a broader^ level, if ifesults are not routinely shared with 

parents/ the 'publi6> ancy the sta^f/ shouldn't we expect results to be viewed 

with suspicion 6i dist^^t or ^ds not^Useful? 

ERIC ^ ^ \ ; ■ / - V ^. 



II.. STATUS S CHANGE . , ■ ; ' 

Use; Checking status across grad.e levels In key sid)j6ct matter areas 

Coiqjaring dnter-^ and intra-grade, scores atross y^rs, ^ . " ^ 

Misuse: Smorgasbord testing prograins. Frequent changes in test series'* 

Pxobablythe single most useful a«ribxrt:e of standardized, norm-refer- 

/ ' ' ^ 

enced te'st scores is theit con^arabili^y within 'grade across content areas and 

within content areas 'across grades. That is, such scores — probably uniquely 

* * ' ' , ' ' 

— permit schools to assess iin a Aorm^tive sense at ^ least) relative ^status * 

across subject matter areas ^d across graces* ' * . * ^ 

Nevertheless, a disturbingly, large .percentage of school districts are 

unable to avail themselves of* this "feature. Why? For whatever reason^ . 

budget, poor leadership, a committee approa'd^ to decisioti-making, not 

wanting to disappoint any publisher — sizable numbers; of districts use 2, 3, 



or moj^e series at different grades in any given year.; J Such a testing policy, 

by definition, eliminates one of the potentially most juseful features of any 

standardized test* ^ben such a policy exists, the scjiool eit^her is unable ta 

look at changes dcross 'grades/content areas or> worse/, "looks'* at them but 

draws unstpportable, totally inaccurate conclusions J At the risk of slight 

overstatement, I doubt that there is "ever an edu^catipnal ly sound justifica- 

tion<for such a smorgasbord testing plan* > | , ^ ^ 

t A related, though perhaps ies^ pervasive, problem is that of changing , ■ 

^ "J 

a systemwide test Seiries on a frequent ba^s* ''Freouent'' is' difficult to 
define, thoXigh for discussidn purposes, I would be hard pressed to support 
cKanj^s more frequent than every four or five yeark. Even when changes ar^ 
iDa4e from an olid to a newer edition of a test series, -the problem is present. - 
d^spit^ what piAlishers* equating tables would fndicate. FQr many ^est tises. 
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the .consistency of the nortns is of greater importance than the^j; accuracy 

. ■ ' * ■ , , ■ -"^ " 

^ vis a vis some theoretical^ national average, ^ ^ach, time a test series is ^ 

changed, this consistency "is lost/ .TTie ir|terpretiva.,.;value of an unqjiajiging 

frame of ^reference is bften oyerl-ooked in tHe searcji for .the .most up-to*date 

test aftd tesjt'^^norms as. possible* In jiqfet in^t^ce's, cori'^istency of t'he nonus 

has miteh^ grqj^ter iJi^Jort th^n dbes currency. , 

yse: - "Assessing change individual "group over tijne. 

Misuse: Unrealistic "growth" e:<pectancies. 

. ^ \ \- . ' ' 

, ^ One of the primary reasons schools use NRTs is to assess change; In 

fact/perhaps thS broadest "use" of such tests today is in Title I and other 

compensattJry programs, in which change assessment is the primary, if not 

sole, purpose, ^ Nevertheless,r there continues to be large^ riujnbers of districts 

in which growth e:q>ectancies are totally unreasonable. 

, Exan^jles of this i^tuation are not difficult to find/ They include the 

,followingj each of which occurs frr more frequently than any of us"-.-- publishers, 

"informedJ' users, evaluators, ivory-tower academicians ^oul<^ like to adrait:^ 

"All students in this program should show 5 NCE *mits' growth." 

CTtti's is the. 1980*s version of the' "year's growttT for a year's in- 

sj:^fuction" ^slogan. Hard as this %s for our DOE, R^fC, TAG and other 

.alpjiabet friend^ to accept, the current slogan is only marginally 

^ . more 4igestible thart* its distasteful predecessor.) 

■ ^ *T]^e' average pR for ' dach of our elementary btiildings will fe^ncrease 

by ],0,podiits this year." ^ 

t'Eveiy child (btiildingj should score above average on this test*" 



of our students failed to show normdl growth this year/' 
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III* FRAMES OF REFERENCE ' . . ^ - 

Use: Comparing ability and achievement test r'esults to discover in- 

* ' dividuals whose measured achievement deviates significantly from 
their school ability level* 

Misuse; Interpreting small, non-significant ability^-achievement, differences 
■ as revealing problems* Coi^paring restilts oh ability and achievement 
testf that were not normed' together* Considering ability- test 
Results as indicating innate, immutable intelUgence* 

Interpreted with the appropriate amount of caution, ^alysis of signi- . 
ficant diffeVences between concurrently normed ability and achievement tests 
can be revealing and ins true tionally useful. The key ^portions of this posi- 
tion are "with the appropriate amount of caution,** ''significant differences,** 
and "conciitrently normed*" If any of these are not met in a specific in- 
stance, the value of the comparisons will range from meaningless to harmful* 
A subtle but, I believe, meaningful distinction in ability-achievement 
comparisons iS between interpreting the results in an **e:q>ectancy'* sense 
versus in **predictive'' sense* / ' 

Many of my more academi.cally oriented colleagues continaie tw> prolong : 
the age-old debate of whether intelli^gence/ability tests actually measinje 
anything distinct from achiev^ent tests* It's really tim'e to put this silly 
topic to rest — of course they measure different things*; Not totally ^ \ 
different, not uncorrelated, not two^ sets of unique charactei?istics, butv/ 
clearly different things* Of course how able someone is relates slightly 
with his current achievement* And of course what"5omeone 'has leanij^d' to 

date affects how able she is to leam future things* But pr^eijd' that 

' . . , V 

ability and achievement are one and tlie same^ or tliat eVen, cru?:Tent state-of* 

the-art measures assess the **same thing" ^spite their l^b^ls As^patently " 

false and inattentive to facts* ^ : ./^ 
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Looked, at l^urely in a statistical sense — which I guess A£RA meeting 
presenters should do — assume a typical pair of achievement and ability 
tests* The ability test would have a reliability coefficient of about *90 

most are better* but I work better with rounded numbers* All of Cis 
technicians visualize a pie called intelligence or ability or some such in 
which 90% is **clean** (whatever it is we're really iheasuring) and the. other 
10% is garbage^ — **errorM to purists* Now let's add the achievement 
test. The typical achievement-ability test correlation is about *7S* 
Using my psychometric snake oil> I come up with a new ability^-achievement 
pie in which we'still have "the 10% garbage, a 56% slice called achievement- 
ability, and 34% that's unique to the ability test. No-one can honestly 
claim that something that accounts for ofver a third of the pie is trivial* 
Not something we want to pay attention to for whatever reason -r ^me, politics 
cost — OK, Not 'worth it" in a measurement $ense short-sifted* 

» r 

Use:^ Comparing achievement scores with appropriate benchmarks 
national, ■ local, other subgroup Normative data sets* 

Misuse: Selecting benchmarks that result in misleading conclusions about 

status or change* ' " " ■ 

* National norms, despite what the popular press would have us believe, 

are no t on their way out* Such data continue to be widely requested and almost 

as-widely used. I see na signs of the demise of hatiqnal norms for the 

traditiormr types of survey tests. , . 

On the other hand, other types of norttiative data ar e frequently requested 

and, fap less frequently, provided for fiertaln NRTs^* ^ I'm%hinking here of 

such siAgroup data as regional, piAlic vs* Von-public, large-city. Title I» 

special education, and socioeconomic status nibrms. Many, if not most, test 
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usets would like to have som^ type of siib-national norms for the tests they ^ 
are using. Many, of the norms 'sets currenily being provided for such purposes , 
are remarkably devoid of technical soundriess, hbwever. Potential users "of 5uch 

J H 

jionas sets ne^ to inspect th^ representativeness of these data very care- 
fully rather than faithfully adopting the data as if they were sound and 
well-developed. 

■ . ■ ' > ■ ■ 

A. finals sef of popular norms ^f or tests is "local norms," based only on 
the distinct population," Despite their surprisingly wide use, aii4 sanction 
by most measurement specialists, these data have several weaknesses or 
limitations : 

1) It is extremely ^ rare to find local personnel who can correctly 

interpret data. I have repeatedly heard such things as, /■ 

''National norms tell how we compare to people nationally and 

local norms, compare us with similar local districts/'^ dr, 

♦ 

"Our district average is at the 43rd percentile In national norms, 
but we're right at the 50th in tentfs of local norms." Or, 
"Oiir averjiges in national norms have been dropping over the past * 
• few years, *but wi^fh loc^^l norms, We'il^ hoJRng our own."" Such 
statements give me^ri'tt).e reason to suspect that local norms are 
( interpreted nearlyyts well as are national norms* And we all 
know how well national norms are interpreted. 

/ 

2J In cnrder to be most useful, local notms should be developed once and 
then re-used over the next few years. Otherwise, year to year 

changes cannot be tracked. However, I know of no large-scale * 

V ' ■ 

developm^t and use of local norms of this type. 
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In low-performing districts,^ local norms can. lead tp education- 
ally damaging interpretations,^ t^ie real message o^ten conveyed 
in such cases is, ."Mrs. J^ones; your son can^t read, but hfe can* ■ 
read almost as well as the^ other Jcid& who can*t read/' This is, 
of .course, no% a problem with local norms > p^r se, but with their 
interpretations. 
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